# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!Lecture 03
Lecture 2: Review
- We covered inductive vs deductive reasoning
- How to begin to ask questions
- Accuracy and precision
- What are general types of data
- How to set up an R project in Rstudio
- How to install and load libraries
- How to read a file into R
- How to make a graph
Our first graph
Lecture 3: How to deal with data wrangling
- Data management overview
- How to make a tidy spreadsheet
- Metadata - why you really should use it
- Data repositories
- R in practice
New image here
Lecture 3: Data management overview
- Data: the raw material of science
- Wide variety of formats, sizes, complexity
- Data management and curation often under emphasized
- Good data management: owe it to our funding agencies, colleagues, supervisors, and study systems
Lecture 3: Data gathering - managing
Step 1
- Decide on what type of data you are collecting
- Decide on controlled vocabulary - odo_mgl, drp_ugl
- Decide on what has to happen to the data flow
- Organize your project -
- Enter the data as soon as you can
- in a spreadsheet as excel and csv
- you really need to be sure it is tidy
Tidy data by Whickham
Step 2:
- Make a MetaData sheet a. data about data b. descriptions, units, etc.
Let’s recreate the basic histogram of fish lengths from our last class. Use the sculpin_df data frame that’s already loaded.
Lecture 3: Data gathering - managing
Step 3
- Store your raw data and metadata
- Electronic dataframes should be stored in ≥3 copies:
- Your computer (onsite)
- External storage (onsite)
- Offsite storage (e.g., cloud storage)
- Have regular backup strategy
Lecture 3: Data gathering - managing
Step 4
- Graph your data and check outliers, errors, missing data
- You can choose a NA or space… opinions differ
- you can set this when you import data
# Specifying NA values explicitly
data <- read_csv("your_file.csv",
na = c("", "NA", "N/A", "missing", "null"))Let’s recreate the basic plots you might use to visualize the data and lets see what it looks like.
# Lets practice making a plot!
# What are the ways you want to see data and lets try them!Lecture 3: Data gathering - managing
Step 5
Cleaning Data -
- Correct errors, fill missing data with “NA”, resolve outliers
- save a clean data file as the master file - often good to make read only
- Add information in a notes column or text file about what was done and why.
Lecture 3: Data gathering - managing
Step 6
- Time to graph the data and explore, summarize, and transform data
- If there are a lot of steps in cleaning and doing transfomations and calculations save them to new output file.
A good way to organize script files is number them in the order they get run.
Lecture 3: Data gathering - managing
The important considerations in data
- enter field/ lab data into electronic format as soon as possible and back it up in at least one location (e.g., cloud storage)
- do not modify raw data in any way following entry into electronic format
- store all data in an open-access format (e.g., .csv)
- thoroughly check and clean your raw data, saving it as a separate file (e.g., “output/cleaned_raw_data.csv”)
- accompany raw field/lab data with meta-data that is unambiguously linked to the raw data file
- carry out all analyses, calculations and visualization on a separate file from the “raw“ or “clean” data: the “analysis” data
- perform all data transformation, analysis and visualization by reproducible code and code shall be stored together with data
- arrange all raw and analysis data in “instance-row, variable-column” or tidy format: one column per variable
USE CONTROLLED VOCABULARY AND BE CONSISTENT THINK BEFORE DOING –> WHAT HAPPENS DOWN THE ROAD
Lecture 3: Data gathering - managing
Broman KW, & Woo KH. 2018. Data organization in spreadsheets. The American Statistician 72: 2-10 (HERE)
- Spreadsheets break data - use with extreme caution
- Spreadsheets: data entry and storage
- R: visualization and analysis
- Goal: organize data so readable by humans and computers
Lecture 3: Data gathering - managing
Be consistent!
- Variable names
- Codes for categorical variables
- Variable names
- use snake case and lower case - nitrate_n_mgl
- always use the same name
- Codes for missing values - NA or 9999 or a space - I know but I do it
- Date formats -
- YYYY-MM-DD HH:MM:SS
- Time begins in 1970-01-01
- names of objects
- dataframes after import data_df
- plots - len_wt_plot
- models - anova_wt_model
- File names
- use separators - 2025_02_01_lake_x_inflow.csv
- Note format Requires considerable foresight and organization
example of fish data
So there are two issues
- what can you do when reading in file?
- what can you do when the file is in and need to fix things
# lets do # 2 first - no pun intended
# if you wanted to rename variables what would would you do?
# now time for #1 - there are tools to make your life easier
# install.packages("janitor") # what does a janitor or BSW do?
# library(janitor)
# lets read in a messy file... junk.csv
# first look at the file
# df <- read_csv("data/junk.csv)
# df_excel <- read_excel("data/junk.csv")Lecture 3: Data gathering - managing
Variable and file names can be a problem
- Avoid spaces but use underscore
_ - Avoid special characters @#$%^@#
- Be sure to also use a variety of separators so you can separate later
- or use the same number of characters across a variable name
- 2025_03_04_file-site
Lecture 3: Data gathering - managing
Excel will drive you mad
- it will mess up your dates
- store data in separate columns - year - month - day
- or use a string 20250401
- always use unambiguous format of larges to smallest - why?
is
01 04 2025the same as04 01 2025what are the dates in english?
or European
Let’s look at junk.xlsx
# Write your code here to create a histogram of fish lengths from Toolik Lake
# Remember to use the pipe operator %>% and ggplot with geom_histogram()
# copy the date to a new cell and make a number!Lecture 3: Data gathering - managing
Never do Calculations in Excel
- always do calculations in R - reproducible
- never merge cells
- can use highlighting but it will disappear
- a nice rectangular dataframe will make you happy
- tears will flow if not
Lecture 3: Data gathering - managing
Meta Data
- This data will love beyond you
- Someone will need to interpret it - what do they need
What is data about
Who collected it
When
Where
Funding agency
Methods used to collect
Variable names
description
units
abbreviations
CALCULATIONS AND WHY?
We need to know what happened and why and the units and WTF it means?
TGW - yep its a thing
ODO - what do you think it is?
NO3 - what is it? Are you sure? Why might you get in legal trouble if you used this?